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Notes for the Improvement of a Remote Sensing Multispectral 

Data Non-Supervised Classification and Mapping Technique 

I . INTRODUCTION 

A. Background 

Any adequate analysis of flight data statistics from remote 
sensing multispectral scanners leads toward a computational burden 
so extensive for repeated and extended scenes of the earth from orbit 
that we endeavor to adapt or develop and perfect an effective algorithm 
for the automatic articulation of a scene. To be adequate, such a method 
must both be effective, or thorough in that it articulates a scene 
accurately, and be efficient, or economical in that its computational 
requirements are reasonably within the state-of-the-art. The sought 
technique is a non-supervised classification and mapping technique to the 
extent that it should achieve articulation of the scene independently of 
any other information or training area. The interpretation of the variously 
articulated and correspondingly mapped characteristics of a scene of 
interest would be obvious only to a limited extent, and adequate identifi- 
cation would largely require comparison with ground truth information for 
their complete identification. However, the advantages of being able to 
complete automatically so much of the analysis and a considerable compression 
of data should be obvious. Consequently, it has not gone without notice 
that mappings of, per se , the spectrally dependent articulations and their 
easily followed deriatives may play the most fundamental role in any 
analysis for change detection. Probably the retrieval of automatic change 
detection from multispectral scanner data must presuppose an adequate 
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algorithm for the non-supervised automatic articulation of the undoubtedly 
obscure spectral functions of the data from repetitive overflights of a 
scene of interest or from scene to scene as the case may be. This follows 
from the mentioned functions being obscure and inconstant even in the 
absence of any changes in the characteristics of interest when other condi- 
tions change. This means that the signature of an item of interest may be 
somewhat variable inadvertently, seemingly necessitating insatiable demands 
tor ground truth data in order to re-calibrate the signatures before data 
classification can be continued in those techniques which require supervision. 
Contrariwise, the techniques which we pursue, the unsupervised techniques, 
adjust automatically to any changes in the signatures. 

B . Present Situation 

During recent months there has been documented two different 

algorithms for the subject technique: (1) Su's^ model, called Sequential 

2 

Clustering," and Jayroe's model, "Spatial and Spectral Clustering." Each 
1 2 

of the authors * , using samples of data, gave sufficient results to prove 
that his model separately constitutes a major accomplishment. Each model 
works; yet, the two models are quite different. Therefore, any immediate 
attempt to combine the two models before they are fully developed and better 
understood might be deleterious to their collective potential. 

C. Opinion and Purpose 

It does not seem prudent at this time to favor one of the models , 
in their present forms, over the other model or to decide which one of them 
has the best potential. Consequently, to minimize comparison between the 
two models at this time, this note will not further review Reference 2. 
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The purpose of this critical review is to try to find any parts of the 
model'*' for which there may be a theoretical basis for a revision which 
might improve its effectiveness without sacrificing computational efficiency. 
The present model'*' had the benefit of adjustments after experience with data. 
Similarly, the considerable further revision based on theoretical considera- 
tions given in this note should benefit if parameter adjustments will be 
fine tuned through experimentation with data. 


II. DISCUSSION 

A. Description 

Anyone wanting a general account of the "Sequential Clustering" 

model more briefly than its developer* - gave will find a very helpful brief 

3 

coverage of its principles and operation given by Krause and Frederick . 

3 1 

They identify the sequential variance analysis as the key to Su's work, 

and they note that it was originally developed by Krause, Jones, and 
4 

Fisher to detect periods of stationary behavior in time series. Howsoever, 
the least-squares derivations of the sequential variance formulas based 
on modes of chi-square are those which were given by Su and Krause - *. 

Possible improvements to those key formulations will be suggested in this 
note Section II. B. 

The sequential variance analysis is used in Su's* - algorithm to 
test whether scan line segments are homogeneous and to test which line 
segments should be merged in the initial spectroscopic classification. 
Preprocessing depends on the type of data and the objectives of the analysis, 
may be necessary for higher accuracy. The first pass with the data 
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establishes the measures of the classes into which the data will be classi- 


fied on the second pass. Iteration may be necessary because after the 

data are put into the classes they, the data, do themselves give better 

measures of the classes than those measures which could be found in the 

first pass, etc. Anything which can be done to increase the accuracy of 

the sequential statistical tests should both: (1) reduce the amount of 

iterative computation necessary to give the best results, and (2) ultimately 

give better results. 

B . Statistical Sequential Clustering 

1 . Establishing New Classes 

Equation (2-4) of Reference 1 shows that a set of M > 6 

resolution elements are considered to be a homogeneous set of samples from 

a new population or class when the M points, which represent them in the 

hyperspace for K spectral channels, are such that the squares of the ratios 

of their distances from their mean and the distance from the origin to 

2 2 

their mean are all £ T , where T "is some threshold value to be given." 

No reasons were offered and no discussion was given to show whether or not 

2 

the value to be used for T should depend on M. Also, no reasons were 

given for using the mean distance to normalize the distances from the mean. 

The criterion seems somewhat discordant, opposite from what one would have 
expected; e.g., haze increases the albedo of the atmosphere while lowering 
the ground level illumination, and both effects reduce contrast in ground 
level images. Yet, the cited criterion says that when the reflected illumi- 
nation is high, then the difference between different classes must also be 

higher in order for such difference to be accepted as meaningful. 
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Instead of normalizing by the mean, one should normalize each 
of the K components of the deviation from the mean by the sample estimate 
of the standard deviation s^ in that same dimension, where, following 
equation (2-1) of Reference 1, 


S k = 


1 _ 

M 


M 

Z 

i=l 


(x ki - V 


1/2 


( 1 ) 


M 1/2 

the unbiased estimate ( ^_^) is not used herein for the further deriva- 

tions. Then, equation (2-2) of Reference 1 would be replaced by 


K 


k=l 



m 

v — / 


Besides deleting the denominator in equation (2-4) in Reference 1 

2 

there is a further consideration of the relation between T and M >_ 6. 

Because the coordinates x^ are proportional to spectral radiant intensity 

in channel k (see page 2-4 of Reference 1) they can have only positive 

values and therefore cannot quite have normal distributions. Nevertheless, 

from a hypothetical spherical joint normal distribution in the K dimensions 

one can approximate roughly the relation which one might reasonably expect 
2 

between T and M. First, consider the case where the population mean y 
2 

and variance a , per dimension, are known or where M is large enough for 
their accurate determination. Then, for a K-dimensional spherical distri- 
bution (meaning zero covariances) , the ratio of the square of the resultant 
2 2 

distance d from the mean and the variance a per dimension has a chi- 
square distribution with K degrees of freedom; the expected value is K. 
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Let P be the right side area for x ; i.e., for an individual sample the 
probability is 1 - P that 

2 2 2 

4/0 1X P • (3) 

For the set of M samples, from the hypothetical spherical distribution, 
(where one tests the hypothesis that the consecutive samples are random 
observations of the same population) the probability P^ that not any of them 
fail to satisfy equation (3) is 

P = (1 - P) M . (4) 

o 

If one or more of them fail inadvertently to satisfy equation (3) , then 
the first sample is discarded, the probability P^ for which is 

? 1 = 1 - (1 - P) M . (5) 

However, because each discarded sample is replaced with another sample, 
one finds essentially that each sample has the same probability of being 
discarded; that P^ in equation (5) is also the fraction of samples which 
are discarded. When PM is numerically much smaller than unity, then the 
right side of equation (5) is approximated very well by PM, giving for 
the area (right side) index P in equation (3) 


P P-j/M 


( 6 ) 
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Consider now another case of a hypothetical distribution which 
is known to be normal in K statistically independent dimensions. Let it 
be just coincidental that the distribution is spherical, and let the coor- 
dinate means and variances be estimated from M random samples. One wants, 

2 

then, to establish population probabilities for Ax , the square of the 

displacement from the centroid of the M samples when its sample values 
2 

Ax. are given by equation (2). It was given that x is normally distributed 

1 K 

2 

with mean y^ and variance . Then consider 
2 K 2 

Ax Z = E r Z (7) 

k=l 


where 


r 


k 



( 8 ) 


The variable r^ in equations (7) and (8) has a one-dimensional r distribution 
with M - 2 degrees of freedom, for which tables are given in Reference 6. 

With summations for the M samples one finds for r^ that the sample estimates 
of the mean and variance are zero and unity, respectively. The table of 
areas of |r^| show, for example, that the 99 percentile of |r^| is an 
increasing function of M, giving, to the limit of the table. 


2.051 < | r | < 2.556 
6 < M < 122 


(9) 
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The same percentile for the standard normal distribution is 2.576. The 
2 

parameter r will be used in the derivations in Section II. C. for the sum 

of F parameters, but in the present section one will not try to develope 

2 

the distribution of Ax in equation (7) as a K-dimensional F distribution; 

instead, one approximates (x^ - x^)/s^ by a normal variable (x^ - y ) /cr 

and hopes that agreement is sufficient to support the approximation that 

each of the k variables r^ in equation (7) is normally distributed with zero 

mean and unit variance. This is more particularly tempting because the 

2 2 

approximation is used only to support the X distribution Ax in equation (7) . 
So far as one can assume that the data in any of the k channels is statisti- 
cally independent of the data in any other channel, it follows that the 

2 

distance square Ax in equation (7) (normalized for individual components 

2 

as in equation (8) ) is X distributed with K degrees of freedom. Because, 

in this example, approximately the same value was found for the variance 

2 

in any channel it follows that Ax in equation (7) is the same ratio as 
in equation (3); then equations (3) through (6) apply in this case also. 

More rigorous tests are developed in Section II. C. considering correlation, 
etc . 

The examples just considered are unnecessarily restrictive; it 
2 2 

still follows that Ax in equation (7) has the X distribution with K degrees 
of freedom regardless of whether or not the variances in the different 
channels are quite different. This follows because the r distribution for 
the r^ in equations (7) and (8) is independent of the mean and variance of 
any x^. It is anticipated that the extent to which the multispectral scanner 
data will be non-normal will have only negligible effect on the end results 
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just given. One does anticipate, though, that non-vanishing cross-channel 

correlation may have a practical effect in that the number of degrees of 
2 2 

freedom in the x distribution of Ax in euqation (7) may be effectively 

somewhat less than K. Otherwise, instead of the equation (2-4) in 

2 

Reference 1, one would require all the Ax^ calculated for the set of M 

samples by equation (2) to satisfy the following criterion, with K degrees 
2 

of freedom for x '• 


Ax i < X (P 1 /M) 


( 10 ) 


where (P^/M) is the right side area for x and is the average fraction 

of samples which one is willing to discard inadvertently before deciding 

that a homogeneous population is being sampled (new class). For example, if 

K is 4 and M is 6, and if one prefers not to discard more than six percent 

2 

of the samples when they are homogeneous, then all of the Ax^ must be less 
than 13.3; the average or expected value would be K or 4 and the mode or 

most frequently occurring value would be K - 2 or 2. 

The fraction of inadvertent rejects P^ in equations (5), (6), 

and (10) is one type of risk, say "producer's risk." There is also a 
'consumer's risk," the fraction of samples which should have been rejected 
but which are inadvertently included in a new class. The three parameters 
P^, P^, and M are approximately related not only by 


1ST ^V M 


(ID 
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which follows from equations (6) and (10) , but also by 


dP, 


dCPj/M) 


( 12 ) 


which follows from geometrical considerations of neighboring populations. 
Operationally, though, one should ignore P^; should consider empirically 
a parameter 0 as a function of the two independent parameters M and P , where 
0 is a judicious measure of the quality and computational efficiency of the 
analysis. Ideally one would like to have iso-0 contour curves plotted on 
a graph of M versus P^ which would be generated by practice with typical 
data. The results would be used to perfect the model expressed by equation 
( 10 ). 

2 . Merging Excessive Classes 

When the number of established classes exceeds the prescribed 

maximum allowable number W it is necessary to combine the two classes 

max 

which are most similar. Reference 1 used the Euclidian distance between 
the means of two classes as the measure of similarity for this purpose. 
Instead, it is more pertinent and almost as easy to use a distance measure 
in which difference between the means in each of the K spectral dimensions 
is normalized by the two-class estimate of its standard deviation. One 
should, by assuming statistical independence between class i and class j, 
replace equation (2-8) of Reference 1 by 



K 

Z 

k=l 



X j> k 

s . . , 
i.J > k 


2 


( 13 ) 
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where s. . 


2 2 

= Sj -/ K + S 3» k 
i , j , k nu m 


where s . , is tne variance in dimension k for the m. samples in class i and 

1 j K 1 

2 

s . , is the variance in dimension k for the m. samples in class j. One 
3 » k j 

can test the hypothesis that the populations which the two classes represent 

are not different beyond some level of significance or probability P . Then, 

c 

to the extent that the normalized differences in euqation (13) are approxi- 
mately normally distributed with zero mean and variance one, and to the 

further extent that the components in the K different dimensions are 

2 

statistically indepedent, the squared difference D. . from equation (13) has 

1 >3 

2 

a X distribution with K degrees of freedom; i.e., the probability is 

P that 
c 


d L i hi - p c > 


(14) 


For example, K is 4 for a multispectral scanner with four spectral channels; 

then, without interchannel correlation, it follows by equation (14) that 

2 

the expected value of . from equation (13) is 4, there is only a 10 

2 

percent chance that D. . would be as small as 1.06, and there is even a 

i,3 

10 percent chance that it would be larger than 7.78. 

3. Classifying New Samples Into Established Classes 
a. Tests Being Used 

In the statistical sequential clustering method which is 
used in Reference 1 each new sample is checked (to see to which one, if 
any, of the established classes it should belong) by a series of two tests. 


11 


The first test is a sequential test of the variance, to restrict its increase 
or decrease. Among the classes which are compatible with the new sample 
by the sequential variance test, the second test assigns the sample to the 
class for which the normalized distance from the mean is the least. Regard- 
less of other changes which seem to be needed, reversing the order of two 
such tests would seem to be an improvement. When the apriori assumption is 
that the different classes represent populations which may have equally 
likely membership, then, one might test to see which classes, if any, are 
such that the normalized distances from the new sample to the means of 
the classes have reasonable values. Then, instead of choosing the smallest 
one of those values, one might prefer to consider, say, the smallest three 
values and use either a sequential variance test or a sequential mean test 
to find which one of the three classes would most nearly continue its 
sequence in the way which is in best agreement with the particular order of 
the compilation of the class. 

b. Mean Versus Mode Estimators 

In his sequential variance test, Su^ continued as Su and 

5 2 

Krause had done by, beginning with equation (2-15), using the mode of X 

("most probable value") instead of the mean (expected value); thus, the 

factor (m^ - 3)/(m^ - 1) in equations (2-16) and (2-18) of Reference 1 and 

in equations (5) and (7) of Reference 5 is spurious and undoubtedly must 

bias the result considerably. Also, if the mean had been used instead of 

the mode, then the sequence (see equations (2-15), (2-17), and (2-18) of 

Reference 1) could have started with the second sample instead of the fourth. 
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c. Variance Versus Standard Deviation 

Another discrepancy of unknown consequence in the 
sequential variance test in Reference 1 was continued as Su and Krause^ 
had previously done by assuming that an appropriate estimator for standard 
deviation is the square root of the corresponding estimator for variance. 
This may be a reason for their having used the mode instead of the mean 
as a basis for the sequential analysis. Howsoever, equation (2-14) in 
Reference 1 is a correct beginning for the derivation of a sequential 
variance test: 


ms 

~2 

a 


115 ) 


where m is the number of samples in a class being checked, including the 

prospective member as the last member where the sequence of compilation is 

preserved, and where some subscripts for channel number k, etc. are 

temporarily dropped for brevity. Instead of equation (2-15), the mean of 
2 

X of m - 1 degrees of freedom is 


Ely 2 ] = m - 1 


(16) 


and, instead of equation (2-16), the mean of s^ is, by equations (15) and 
(16), 


2 
s . 
J 



for j 


2, 3, 


m . 


(17) 
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d. Normalization: Standard Deviation Versus Mean 


it is 


Let a ^ be the standard deviation of s.; then, by Reference 6, 

s . 3 

1 


2 1/2 
0 2 = 0 Z 12 (j - l)] i/Z /j • 

s . 

J 


( 18 ) 


The question at this point is whether, in the least squares summation as in 
equation (2-17) of Reference 1, the deviations from the mean should be 
normalized or not, and if so, with what? In References 1 and 5 the devia- 
tions from the mean were normalized by the mean, in equation (2-17) 
similarly as in equation (2-4) of Reference 1. It seems necessary to 
normalize the differences by the standard deviation in equation (18) so 
that the least squares determination is not dominated by a few of the most 
uncertain values. Then equation (2-17) in Reference 1 should be replaced by 



Then, where 0 is the sequential estimator of O which makes equation (20) 
vanish, equation (2-18) of Reference 1 should be replaced by 
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m 

= I 
j=2 


.2 

j-1 


m 


sV / £ 

3 j=2 


O 

. 

1 s . 


( 21 ) 


e. The F Distribution Versus Chi-Square 

5 2 

Su and Krause gave the same distribution, x with m - 1 

2 ^2 2 2 
degrees of freedom, for m s /a as they had correctly given for ms /o , 

and Reference 1 continued that presumption in its equation (2-19). It 

~2 

would be difficult to establish the distribution of O in equation (21) 

from basic principles. That sequential estimator of the variance is the 

ratio of the sums of two series, but the corresponding terms in the two 

series not only are not statistically independent of each other but also 

not statistically indcp end cn t of pirGcsding t Grans Tlis ^relations sirs 

^2 2 

very involved; however, it seems likely that the distribution of m O /a 

2 2 

might not be appreciably different from that of ms /o . Then, although 

2 2 2 

it follows from the normal assumption for x that ms /a has a X distri- 
bution with m - 1 degrees of freedom it reasonably can be suspected that 

'2 2 2 
m o !o may have also approximately the X distribution with m -1 degrees 

2 ~2 

of freedom. Therefore, their ratio s /o probably could have nearly an 
F distribution with m - 1 and m - 1 degrees of freedom or not, depending 
on whether the correlation is low enough. Therefore, while the correlation 
has not been evaluated either theoretically or by Monte Carlo experiment, 
the distribution of the ratio is quite problematical, and using the chi- 
square limits in equations (2-19) and (2-20) of Reference 1 (and in equation 
(8) of Reference 5) is quite arbitrary and is not known to relate to the 
stated percentage of significance. Some further analysis to illustrate 
the nature of sequential tests is given in Section II. B. 4. herein. 
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f. Replacing Several Tests with Similar Tests 

It will be shown in Section II. B. 4. that the kind of 

sequential test which is used in Reference 1 (to give a least-squares 

estimator of the variance) gives an estimator which differs from the one 

commonly used, the maximum likelihood estimator, in that the weight given 

to a member of a given sequence depends on its position in the sequence. 

In looking for ways to reduce the burden of computations, which sometimes 

increase as refinements are added, it will be shown in Section II. B. 4. that 

it is prudent tentatively to abandon sequential tests, for their use is not 

likely to be a reason for the effectiveness of the method which has been 

demonstrated in Reference 1. It seems likely, too, that the number of tests 

should be reduced. Instead of having two separate tests to classify a new 

sample, a sequential test of the variance and a non-sequential test of the 
2 

mean (called X -test and N-test in Reference 1) , it seems preferable to 
replace those two tests with one non-sequential test of the deviation from 
the population mean. This test will be developed in Section II. C. from 
student's t distribution. Reasons why the same test, or a similar one, 
should also be used not only to replace the one to establish new classes 
but also to replace the one to merge excessive classes will also be given 
in Section II. D. 

4 . Nature of Least-Squares Sequential Tests 

In a class of m samples, including the prospective member of 

the class, which are considered to be random observations x. from a homo- 

J 

geneous normal population of observations of a characteristic scene in a 
given spectral channel, the maximum likelihood estimator x for the unknown 
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mean y of the population is 


m 


x * (1/m) £ 

j=l 


x . 


( 22 ) 


Equation (22) shows that x is a random variable, a function of the random 
observations x_. , with a value x^ for each serially-increasing sub-set j of 
m. The expected value and standard deviation of x. are y and a// j", respectivel 
where a is the unknown standard deviation of the population x being sampled. 

The sum F of the squares of the normalized differences between x^ and y is 


m 


Ffu, o’ 2- ) = £ i(x - y) 2 /a 2 

j-1 " 3 




for which the partial derivative with respect to y is 


9F 

ay 


(2/a 2 ) 


m 

£ 

3=1 


- j(x_. - y) 


(2/a 2 ) 


\ j ‘ 


m 

- £ 
j-1 



(24) 


Let y be the sequential estimator of y such that its value for y makes 
equation (24) vanish; then 


y 


m _ m 
l jx / £ j 
j-1 3 3=1 


X 1 + ^ x i +x 2' ) + ( X i +X 2 +X 3^ + ••• + ( x 1 +x 2 +. " +x m ^ 


l 3 
3=1 


(25) 
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( 26 ) 


x + 2x . + 3x „ + . . . 
m m-1 m-2 


m 


T. 

J = 


+ mx^ 


Because of the statistical independence of the observations, it follows 

A 

from equation (26) that the mean of y is the population mean y and that 

, . 2 . 
the variance a ^ is 

y 




= 2 (2m + 1) a 2 /3m(m + 1) . 


(27) 


Thus, x and y, the two estimators of y, have the same mean, and the ratio 
of their variances is 

° 2 y/ <7 2 - = 2 (2m + 1) / 3(m + 1) (28) 

which increases asymptotically from one toward 4/3 as m increases from 

A 

one. In considering y in equation (26) as a random variable, one does, 

of course, imply that the specific observations x are to be replaced by 

not-yet-made observations, that they are a set of statistically independent 

2 

normal variables, each with the same mean y and variance o . Thus, both 

A 

x in equation (22) and y in equation (26) are linear functions of the same 
set of statistically independent normal variables, so they are also both 
normal and somewhat correlated. 

A 

The correlation coefficient p of x and y is related to the 
covariance A, involving expected values E[ ], by 
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p 


(29) 


= \/o_o~ 

x y 


A = E [ (x - y) (y - y) ] 


= E 


m | <x r 


-1 


m 


y) + (x 2 -y) + . Z j j 


(x -y) + 2 (x -y) + 
tn m-l 


m 


E [ (x -y) 2 + 2(x ,~y ) 2 + ...]/m Z j 

m m- 1 , _ 

j=i 


E[(x-y) 2 ]/m = a 2 /m 


(30) 


rr— = a /i/m 
x 


(31) 


Then, by equations (27), (30), and (31), it follows that the correlation 
coefficient p in equation (29) is 


1/2 

p = [3(m + l)/2(2m + 1)] ' 


(32) 


which decreased from a maximum value one to an asymtotic value 0.87 as m 
increases from one. 

No way is evident whereiy these results for y could be used to 
construct a criterion for classification. The purpose which is served, 
instead, is heuristic, to examine an estimator y which is simple enough for 

its properties to be shown and which belongs to the least-squares-sequential 

-~2 

family in which o in equation (21) is too difficult to analyze very well. 

A 

Equation (26) shows that y involves weighting the members of the class in an 
arithmetic progression from the last to the first, and is therefore very 
insensitive to the last member or prospective member. It is difficult to 
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see what advantage, if any, this might have. Actually the mean and variance 

A 

of y and the correlation between y and x are all invariant to reversing 

the order of the weighting progression. The correlation 0.87 by equation 

(32) is even higher than one might have guessed: it probably is a good 

indication that all such estimators may be highly correlated with their 

corresponding unbiased or maximum likelihood counterparts. If so, then 

2 

both the F distribution discussed in Section II. B. 3. e. and the X distri- 
bution, which was used, are quite inappropriate for equations (2-19) and 
(2-20) of Reference 1 and for equation (8) of reference 5. 

C . Classification With F Distributions 
1 . F Distributions for Each Channel 

Because the analysis so far in this note shows that the 
techniques which were used in Reference 1 to classify a new sample, to 
decide whether or not it should be put in an established class, are seriously 
deficient of any firm statistical theory basis, one now returns to develope 
further the technique of equations (7) and (8) of Section II. B. 1. in 
order to have not only a valid test which will serve to decide the addition 
of subsequent members but also a similar test to establish a new class. 

After the formulation has been developed in this section for zero correla- 
tion between channels, it will be revised in Section II. C. 3. for correlation. 

The expedient by which the same test, or a similar test, can 
be used both for establishing a class and for deciding further membership 

in the calss is as follows: (1) a class with m members infers a population 

2 

for which Ax in equation (7) has a consequent distribution with limits 
which the prospective next member is required to satisfy, but (2) in checking 
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for a new class each of the M prospective members is checked against 
possibly other limits for the same distribution which they collectively 
infer for a further prospective member. Thus, the two tests are different 
only because on the one hand the variance of x - x in equation (8) is 
different because x and x are statistically independent only for assignment 
of x to a class for which the mean has already been assessed as X and on 
the other hand the two tests may be different because different fiducial 
limits may be used for deciding to accept the hypothesis being tested. 

The procedure which will be followed in the derivation is that 

2 

r in equation (7) is proportional to a variable which has an F distribution, 

etc . 

The variance of the numerator in equation (8) is 

= a 2 (m+l)/m (33) 

where the minus sign is used in testing for a new class and the plus sign 

is used in testing a new sample for membership in an established class. So, 

when the numerator of equation (8) is normalized with its standard deviation 

2 

its square has a X distribution with one degree of freedom. Also, the 

2 2 

square of the denominator time m/a has a x distribution with m - 1 degrees 

2 

of freedom. The ratio of those two X variables with each divided by its 
own degrees of freedom has an F distribution with 1 and m - 1 degrees of 
freedom; i.e., 
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where the choice of sign has the significance which was stated for equation 

(33) . In the unlikely event that correlation between channels is small 

enough to be neglected , then the mean of the sum is the sum of the equal 

means and is Ky , see equation (36) , and the variance of the sum is the sum 
F 

2 

of the equal variances and is Ka„, see equation (37). Of course, these 

r 


22 


results presuppose the equal weighting for the data from all channels as 

K K 

per equation (38) , but if unequal weighting w / E w or w / IT w is 

k=l k=l 

wanted it has only to be inserted in both sides of equation (38) . 

3 . With Inter-Channel Correlation 

Regardless of how the K parameters in equation (38) are 
correlated, the mean of the sum is the sum of the means, 


y ZF KU F 


(39) 


and the means and variances of all of the F, are invariant of k. Because 

k 

all of the first partial derivatives of the right side of equation (38) 
with respect to the F^ are one, it follows exactly by the propagation of 
error (e.g., Reference 7) that the variance is 


K-l 


K 

E 




T EF K °F + 2 1 kl 

k k=l l=k+l 


(AO) 


where the covariance A between and F^ is, by equation (34), 


X kl = Et(F k -y F )(F 1 -y F )] 


= E[F kFl ] - y p 


(41) 


1 m 2 
- E F F - 
m - ka la F 
a=l 




m-1 

mTl 


2 2 
S k S l 


1. 

m 


m 


E . [(x ka' X k )(x la- x l )l2 
a=l 


(42) 
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and where CT^, in equation (40) and y in equation (39) are functions of m 
alone in euqations (36) and (37). Thus, the right side of equation (38) 
can be replaced by the sum of its mean from equation (39) and some constant 
A times the standard deviation from taking the square root of equation (40); 
i.e., the criterion is 


m-1 


m+1 



£F 


/0 EF i A 


(43) 


The choice of sign in equations (42) and (43) , again as in equations (35) 
and (38) , is the same as that which was stated for equation (33) . 

It must not go without notice that the main computational 
burden is imposed by the necessity to compute the covariance in equation (40) . 

D. Merging by F Distributions 

In Section II. B. 2. the squared distance between the empirical 
centroids of two classes, equation (13), was found by ignoring any inter- 
channel correlation and by using the normal approximation to the components 

2 

for which the sum of the squares is X distributed. It will now be shown, 
without making those approximations, how to transform equation (13) into 
a sum of parameters which have each an F distribution. 

First consider only one component and temporarily drop the channel 
subscript k. Let y^ and y^. be the means of the populations for classes i 
and j which have size m^ and ny. , etc. as was stated for equation (13). 

Then, t.. as shown in Reference 8, 

7 1 -i 5 


24 


1/2 


(Xj - „.) - (p. - 

^ /m . s? + m.s? 

11 ] ] 


m,m. (m. +m. - 2) 

_J_J I J 

m . + m . 


( 44 ) 


will have Student's t distribution with nu + nu - 2 degrees of freedom. 

2 

Then, by Reference 9, t.. has an F distribution with one and m. + in. - 2 

ij i 1 

degrees of freedom; i.e. , 


m.m.(m. + m . - 2) 
ill 3 

tn. + m. 
i 3 


[(x.-x ) - (H -V )]' 

■J . — . i J 


(m.s 2 

li 


2 , 

m.s.) 
3 3 


= F. 


k, ij 


(45) 


where the mean u 


and variance a 


k, ij 


of F, . . are, by Reference 6, 
F, . . k, ij 
k, 13 



ij 


m . + m . - 2 

i 3 

m. + m. - 4 
i J 


m + m . > 4 

i 3 


(46) 



ij 



m. + m. > 6 , 

i 3 


(47) 


The K channels could be considered collectively by summing equation 
(45) just as equation (38) was given by summing equation (35); then, 
equations corresponding to equations (39) through (41) would follow by 
changing the notation 


X 


kl, 


ij 



ij F l. 




(48) 
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E M (x ki- y ki )(x ii" y ii } + (x kj' y kj )(x ij‘ y i j ) ! 


— I [(x -x ) (x -x )] 2 +^— I [(x -x )(x -x )] 2 
IIK a=1 kia kx lia li m_. kjb kj I 3 D I2 1 


where equation (52) follows from equation (51) because correlation is assumed to 
be appreciable only between channels within a class and not between a given 
channel and given class and a different channel and different class. 

Whether or not such correlations might be sufficiently small to support 
elegantly the computational expedient by which equation (51) is replaced 
by (52) and in turn by (53) could be established by analysis of representa- 
tive data, but only relative results are needed in the test for merging 


26 


excessive classes because it is only a question of which two classes to 
merge and not a question of whether or not to merge any classes. 

It will be seen that the summed terms in equation (53) are the 
same as that in equation (42) when they are converted to the same notation; 
thus, the criterion for merging two classes does not require a separate 
computation of such summations which are already used in the criteria for 
forming new classes and classifying new samples into established populations. 


III. ALGORITHM FOR UNSUPERVISED CLASSIFICATION USING F DISTRIBUTIONS 

For each class or prospective class one needs values for the following 
p3.T73.inGt SITS ! 

m = number of members in the class, m 6 
1 m 

x. = — I x , class mean in each channel k = 1, 2, ..., K 

k m , ka ’ ’ ’ 

a=l 

2 1 m — 2 

s, = — Z (x -x ) , class variance in each channel 

in i tea K 
a=l 


m 


1 — — 2 

Q, . - — I [(x, -x. ) (x- -x,)] each pair of channels k and 1 

K. 1 . in i lea k. la 1 

a=l 



_ / m-lV 
y m+1 ) 

2 2 
Q kl /s k 

U F 

( m-l N 

'I "' 1 * 3 , 

) 

2 

°F 

- 

(s) 

y ZF 

- km f 


2 

°IF 

- k 4 

K-l 

+ 21 
k=l 


f! 11 II It It IT I! 


kl 
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Also, for each pair of established classes i and j containing m. and m. 
members one needs values for the following parameters: 


u = (m.+m.-2)/(m.+m.-4) 

F. . 13 ii 

ij 


ij 


ZF 


k,ij 


2y„ (m,+m .-3) / (m ,+m . - 6 ) 

F. . l j ii 

13 

— — 2 

m.m. (m.+m.-2) K (x, . -x, .) 

-J-J L_J Z _. kl .._ k J .. 

m.+m. , , 2 2 


i 3 


k=l m.s, . + m.s, . 
x lei 3 k 3 


kl » ij 


m.m. (m.+m. - 2 ) 
L_J. 


i 2 


m ,+m . 
i 3 




2 2 2 2 
(m.s, .+ m.s, ) (m.s, .+ m.s,.) 
i ki J kj i li ] 1] ; 





Ka p + 2 

ij 


K-l 

Z 

k=l 


K 

Z 

l=k+l 


X 


kl , ij 


A. . 
ij 


(ZF.. 

13 


- li 


£f. .)/o 

ij ZF 
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The two other formulas which are always used together, with a purpose 
which depends on what datum is substituted for the parameter x , are 

K 


28 



IF 


A 


<sr> j x ( W 2 /< 


(lF-y SF )/ c? ZF 


) 


(54) 


The number W of retained classes must not exceed an allowable number 
W 

max 


Step 1. Read control parameters A , A, , M > 6, and W 

o 1 — max 

Step 2. Read the first M samples. 

Step 3. Calculate parameters for prospective class. 

— 2 

Step 4. With the x^ and s^ from step 3, calculate a value of A in 
equation (54) for each of the M samples by using the values of x^ for that 
particular sample in equation (54) with the minus sign. Does the largest 
value of A satisfy A < A q ? Yes: go to step 7. No: go to step 5. 

Step 5. Discard the first sample accumulated. 

Step 6. Read a new sample, then go to step 3 (recursion formulas 
may be helpful) . 

Step 7. Designate a new class having the parameters extant. 

Step 8. Does the program reach the end of the sample sequence? 

Yes: go to step 9. No: go to step 11. 

Step 9. Print out any parameters and classification map which are 
required by the Flight Data Statistics Office. 

Step 10. Stop. 

Step 11. Does the number of classes W satisfy W < W ? Yes: go to 

— max 

step 14. No: go to step 12. 
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Step 12. Calculate class-pair parameters for all combinations of 
classes in pairs (recursion formulas may be helpful) . 

Step 13. Combine the two classes i and j which give the smallest 
pair-parameter A and compute the single-class parameters of the resulting 
class . 

Step 14. Read a new sample. 

Step 15. By using the values of x^ from the new sample in equation 

(54) with the plus sign, calculate a value of A for each of the W established 

— 2 

classes according to their given values of x , O , etc. Does the smallest 

K. K 

one of the m values of A satisfy A _< A^? Yes: add the sample to that class, 
revise the parameters of that class and go to step 8. No: put the sample 
in hold and go to step 16. 

Step 16 . Has the number of samples in hold reached M? No : go to 

step 14. Yes: go to step 17. 

Step 17. Calculate parameters for prospective class. 

— 2 

Step 18. With x and s from step 17, calculate a value of A in 

K. K. 

equation (54) for each of the M samples by using the values of x f°r the 

K 

particular sample in equation (54) with the minus sign. Does the largest 

value of A satisfy A < A ? Yes: go to step 19. No: discard the first one 

o 

of the M samples held for step 17 and go to step 14. 

Step 19. Designate a new calss with the parameter values which are 
extant (from step 17) . 

Step 20. Empty the hold from step 16 and go to step 8. 
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